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A method for querying collated data sets 



Background to the Invention 

Within many fields of information management, data about a particular entity is spread 
across a variety of different databases and data sources. 

As a generic example of data integration we now consider how a government agency 
might make use of the content of the databases they hold about each on© of us. We use 
this example merely to highlight the power of data integration without delving into 
specific examples in Life Sciences. 

Firstly lets us briefly review the sort of data available. Some may be held directly by the 
government agencies and other information may be held by commercial organisations 
with some access fiom government agencies 

Government and local authority records: 
Electoral Register 
Income Tax 

National Insurance Records 
Companies House 
Driving and vehicle licenses 
TV licences 

immigration, passports and visas 
Criminal records 

Public services (Military service, Police, Civil service) 
Paients 

Birth, marriages and deaths 
Medical records 

Possible information available from other sources: 
• Credit references 
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• Bank accounts 

• Mobile telephone and land line records 

• Car insurance 
• Web access and email 

Now imagine that ths totality of this infbmtation caa be searched and browsed as if there 
were no departmental, computational or legislative barriers to such use. 

In such an environment it would be relatively easy for a revenue investigator to find all 
those people who own a Porsche less than two years old, but have incomes of less than 
£100QO per annum and are registered at an address where the aggregate income of the 
household is less than £25000. 

Once such people are found We can then start to browse and drill down into information 
about such people - for instance we might be interested to find out if they have a criminal 
record, the balance and movement of funds in their bank accounts or possibly even find 
out the names of associates... ... 

Within life science a similar number of diverse databases exist and it is a commercial 
priority to make optimal use of these information resources in the search for new 
medicines. Despite the huge investment in new experimental techniques, the number of 
drugs approved by the Food and Drugs Administration (FDA) has dropped from a peak of 
S3 in 1996 to only 24 in 2001. Whilst, according to Pharmaprojects the number of drugs 
in preclinical development has increased by 9% in the same period, R&D productivity is 
not responding to the radical changes in processes and technology that this industry has 
undergone. New research technologies such as combinatorial chemistry, high throughput 
, screening, genomics, proteomics, pharmacogenomics and expression profiling have Lead 
to vast increases in both the volumes of data and the number of different data types. 
Rather than . finding the needle in the haystack pharmaceutical companies are simply 
adding more hay- The keys to success lie not only in new experimental techniques and the 
industrialization of research processes but in utilising existing data mare effectively to 
make better decisions that are likely to yield profitable drugs x&ther than costly failures. 
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Ad an example, infonnatfon about a particular protein may be held in protein sequence, 
aystallograpiio, high throughput screening and medical databases. The ability to search 
for a particular protein using all of the information available remains problematic for a 
number of reasons. 

Prior Art 

Tools now exist that can integrate the existing diverse data sources and computational 
engines within a unifying middleware layer, IBM's DiscoveryLink, MetaJMatrix's 
McteBase modeller and GeneticXchange's Discovery Hub all provide a means by which 
queries maybe composed in. SQL* or other query languages, and the query then run 
against aJl data sources that have been integrated at the middleware layer. Using such 
tools it is theoretically possible for a scientist to write a single query using SQL that uses 
information from chemical, genetic, pfcaimacological and medical databases. 

However, the value of this middleware data integration is reduced since scientists lack the 
skills required to write queries in the available query languages (e.g., SQL) and also lack 
the detailed knowledge of the underlying database architectures and schemes. In feet the 
knowledge of the underlying databases tends to be broadly spread across the IT 
organisation supporting the scientists. It would be very hard for a single seasoned IT 
professional to write a cross-domain 4juery for the scientist* 

As these unifying middleware solutions have emerged, tools that integrate data at the 
presentation layer have also been developed. For instance, LION bioscience's 
DiscoveryCenter provides a way of collating data sets so that a scientist can view all the 
information available about a particular data entity in a single screen regardless of the 
source of that data* In addition such tools provide a mechanism for hyper-linking these 
collated data sets so that the scientist can navigate from a screen about one entity to 
screens showing data on related entities, For instance, a collated data set about a 
particular protein entity provides hyperlinks to a collated data set for a gene that codes for 
that protein, to an assay that i$ associated with the protein, or to a disease in which the 
protein plays a part in the aetiology. 
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collating data from mote than one database in a format which can be readily understood 
and used by the user of {he database system. 

One prior art publication which does describe a computer system with an interface to 
more than one database is US Patent No. US-B^6 236 328 (Coder* et al, assigned to IBM). 
This teaches an object-oriented query model for querying multiple databases, The 
computer system has a plurality of base query objects defined, whereby each one of the 
base query objects is capable of querying a specific database* These can be combined 
together to query the databases. 

US Patent Application Publication No. US-A-2002/0032675 also describes an 
information retrieval system for retrieving information from multiple information sources. 
The system works by building dynamic queries through the use of so-called query 
channels. A query channel permits the parsing of attributes of the search results between 
different queries. 

Because of the growth in biological, medical and pharmaceutical knowledge the data in 
databases can rapidly change. Two prior art publications are known which describe 
system in which dynamic querying can be earned out. US Patent No. US-A-5 421 008 
teaches a method, system and program providing graphical queries. Tables and lists are 
configured from a database to define a common data structure. Dynamic data structures p 
are employed based on the information entered by a user to define various relationships 
between the dynamic data and fixe database information, US-Patent Application 
Publication No, US- A-2002/01 23984 (Prakash) a framework for the creation and 
execution of a dynamic query is taught. Graphical user interface screens will allow a user 
to create leaf conditions (or expression) and logically join the leaf conditions into more 
complex conditions- These conditions are then joined into a query. However, neither of 
these publications teaches a system in which the querying is carried out on. medical 
databases, nor do they teach how links between data can be followed to create a better 
understanding of the data. 

Databases for storing biological, medical and pharmaceutical data are also known in the 
art. For example* PCT Publication No. WOA-02/054187 (Levequc et al, assigned to 
Scientia, Inc.) teaches a database for collecting and managing clinical information. The 
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database includes the aggregation, anonymisatton, analysis and dissemination of 
information including the susceptibility, progtession, severity of disease, the resources 
utilised to treat the diseases, the quality of patient life, the ability to participate in the 
workforce and survival. The information provides an understanding about why 
genetically similar or identical patients express diseases differently. 

Similarly, US Patent Application Publication No. US-A-20Q2/G052671 (Fey et al) teaches 
a genetic health data management system for collecting genetic screening and 
demographic data from clients. The system stores the clients* data and DNA/genetic 
material samples and processes and analyses genetic testing data in conjunction with 
other relevant health data. The system allows the generation of custom reports and 
maintains life-long health records. 

One example of a system to allow the integration of multiple data sources in life science 
applications as described in US Patent Application Publication No. US-A-2002/01 56756 
(Stanley et al, assigned to Biosentients, Inc.). In this publication, methods are described to 
define and describe a specific embodiment of architecture for a so-called Intelligent 
Object data structure. The intelligent Objects contain hierarchical, multi-layered property 
panes for unified user presentation and functional interactivity, as well as components and 
access interfaces to provide data status management, self-organising data and parallel 
data-to-data information interchange and processing. 

All of the approaches described in ihe prior art define the query with respect to raw 
databases and tables. In contrast thereto, this invention describes a method for defining a 
query on the collated data sets, rather than the tables and fields of multiple databases. 

Summary of the Invention 



Although the prior art solutions offer some solutions to the problems of querying multiple 
data sources, there is a need to simplify Ihe process to enable researchers without detailed 
knowledge of the individual database structures to perfottti meaningful ad-hoc searches. 
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There; is furthermore a need to isolate searches from changes in the scheraas of raw data 
sources. 

There is furthermore a need to express a database query graphically. 

These and other objects of the invention are solved by providing a method for searching 
at least one data source Vising a query and comprising the following steps: a first step of 
generating a plurality of query templates. Each of the quety templates can be used to 
define at least a part of the query. A second step of logically joining at least some of said 
plurality of query templates to create a query representation. A third step of inputting or 
selecting input variables into data entry fields of the query representation. A fourth step 
for selecting data elements that will be returned by the query. A fifth step of generating 
the query using the query representation, input variables and the data elements to be 
returned. A sixth step of sending said query to the at least one data source. A seventh step 
of returning source results generated using the query from the at lea$t one data source;. 
An eighth step of generating a reference to a collated data set for each of the source 
results and a ninth step of selecting one of source results. 

A collated data set in this application refer* to a data set about a particular entity in which 
a number of related source results from one or more data sources are collected together. 
Users familiar with the content and hyperlinked structure of collated data sets will 
naturally wish to perform searches which return, collated data sets that conform to a set of 
criteria. For such user, it is therefore entirely logical to allow them to define a search in 
the context of the collated data sets. Once the users have defined, in this way, the system 
performs a search that returns references to matching collated data sets. Although this 
method allows a user to define a query in the context of collated data sets, such collated 
data sets to not need to be created on the fly or stored in the system in atty manner during 
the searching process. This drastically reduces the memory requirements, the searching 
time and ensures that search results accurately reflect the content of the raw databases. 

Hie collated data sets will be generally referred to as dossiers, but the choice of this term 
is not to be limiting. The method of the invention also includes a prior step of defining 
how to generate one or more types of collated data set Each collated data set 
comprising of one or more reports. 
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The query templates used in the method ate related to the report definitions. So the 
system implementing the method can use all or part of the report definitions in order to 
construct the query. Similarly, the query representation is defined with reference to the 
Deports. Generation of the query template can be done automatically from the report 
definition once the report definition has been created by the administrator or user of the 
system. This ensures that the results displayed in the report are accurate. 

In one embodiment of the invention, a search is carried out fox matching dossiers using 
the quety representation. This can be done, for example, by directly searching the data 
sources from which the dossiers are derived. Alternatively tir additionally, this is carried 
out by comparing other ones of the dossiers that are indirectly or directly referenced by 
the dossier being matched. 

Advantageously, the method further comprises a step of hyperlinking from an element of 
the displayed dossier to another dossier. This allows the user 10 to easily navigate from 
one element of the dossier to another relevant dossier. 

The dossier definition includes one or more instances of a report definition. Each report 
definition includes a retrieval definition to define how the members may be retrieved 
from a data source, a display definition defining how the results may be displayed, and an 
access definition for defining the permitted access to an instance of the report definition. 
Furthermore, the dossier definition can additionally include one or more dossier reference 
definitions in which the dossier reference definition defines a link between at least one 
instance of the report definition and at least one instance of a dossier definition. So s fi>r 
example, the dossier reference definition can define how a hyperlink is created between 
an element of a report and the same or another dossier. 

In an advantageous embodiment of the invention, the method further includes a step of 
taking the retrieval definition and inverting said retrieval definition to create a search 
definition. This search definition oan be used in the construction of a query that will 
retrieve source results. Similarly the method also includes a step of taking the display 
definition and inverting- said display definition to create a template fomou 
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A finther.embodiment of the Invention includes a step of creating a dossier linkage 
relationship between an element in a template form with a corresponding one of the 
dossier definitions. This dossier linkage relationship is created using the dossier reference 
definition in a corresponding one of the report definitions. 

The step of creating a query representation comprises in the invention a step of 
assembling a query structure in which said plurality of query templates are joined using 
nesting and/or Boolean logic. The query representation layer is then constructed by using 
a plurality of query templates logically joined together using Boolean logic. The query 
representation can also be constructed with a plurality of nested query representation 
layers. The nested query representation layers maybe used to check the content of 
dossiers directly or indirectly referenced by the matching dossier. 

In yet a further embodiment of the invention, the method further comprises a step of 
creating a plurality of query context subsets of the plurality of query templates. Bach 
query context subset contains query templates that are associated with a single dossier 
definition through report template mapping. Each query representation layer may only 
contain query templates from a single query context subset 

Access restrictions in the method can be controlled by creating a plurality of access * 
control subsets of the plurality of query templates. Each access control subset contains '-i 
only those query templates that may be accessed by a particular user of the system. 
Retrieval of query representations are blocked for those query representations which 
contain query templates not belonging to the access control subset for the user. 

A further embodiment of the method includes a step of creating a dossier linkage subset 
having one or more dossier definitions. Each member of the dossier linkage subset has a 
dossier linkage relationship with the said selected one of the plurality of query templa tes. 
The so-called Context of the query representation layer is established by selecting a 
member of said dossier linkage subset. 

The method also includes a step of nesting a retrieved query representation below a 
selected one of the plurality of query templates in the current query representation. The 
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set under the scrutiny of matching algorithm. This method takes such a query 
representation and generates a database query that may be performed against the raw data 
sources or those data sources unified w a middleware layer. And, once the results are 
returned, converts these into references to a set of collated data sets that meet the 
constraints of the query representation. The method therefore does no£ generate and 
search all collated data sets, this would be computationally grossly inefficient. As such 
this patent describes a method for graphically querying virtual collated data sets. 

Whilst the description above uses examples from Life Science research - similar 
problems occur in many other industries. Our method provides both a powerful way of 
expressing a query graphically and also provides a layer of abstraction from the raw 
databases. A user of the system and method, such as a researcher or a knowledge-worker, 
will be able to perform complex and highly flexible queries with neither the knowledge of 
traditional query languages nor knowledge Of the underlying database architectures and 
schemas. 
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Descriptio n of the Drawinoo 

Kg. 1 shows a searching systetn 

Pig. 2 shows a diagranunaticrepresemationofa set of dossiers 
Fig. 3 shows a report and its associated template , 
Pig. 4 shows toother report and its a^ociated template 
Fig.S shows a further report and its assorted template 
Fig, 6 shows an assay report 
Fig. 7 shows an assay template 

Fig. 8 «hows the relationship between do«ri«r« » n A „™,„, 

p flossiers and qnery representations using reverse 

hnVing 

Fig. 9 shows a dossier definition in XML 

Fig. 10 shows the relationship between report definitions and templates 
Fig. 11 shows the relationship between dossier defimtions, URNs and dossiers 
Fig. 12 shows the ralation8hip betweOT dQS?iera md ^ Iepresamons 
linking 

Detailed nesgrfpri™ afgig fay ^t^ 
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on the molecule, results of quantum mechanics calculations, list of safely instructions, the 
fiill IUPAC chemical name, patent information on the compound. From the point of view 
of the user 10 putting all the information about a particular entity in one place (i.e. in the 
dossier), lias enormous benefits and saves much time spent collating the information 
needed to make decisions. - , * 

Entries in the reports may contain hyperlinks to other dossiers. For instance one of the 
reports within the molecule dossier would contain a report about the batches that have 
been syothesised. We can create a hyperlink Atom the names of the hatches to separate 
batch dossiers on each batch- The batch dossier would contain reports on samples of each 
batch, the amount, location form etc. The batch dossier might also contain a report on the 
impurities in each batch or the spectra used for the structural determination. 

The molecule dossier might also contain hyperlinks to an assay dossier. The assay 
dossiers in turn contain hyperlinks to protein dossiers and the protein dossiers contain 
Knks to gen© dossiers etc In this way it is possible to build a network of hypedinked 
dossiers that provide the user 1 0 with a very powerful way to search, and browse data. 

The IT (Ih&nnation Technology) groups within pharmaceutical arid biotech companies 
are responsible for defining the types of the dossiers available to the users 10 and also the 
reports that the dossiers contain, For instance, assume that protein dossiers have been 
created with reports relating to proteins. An administrator from the IT groups can aiow 
add report definitions to this dossier type. Each report definition will take an 
identification number for the protein of interest and generate specific reports for the 
protein of interest. Other groups and companies can also create dossiers and reports. 
These will be collectively termed dossier creators. 

The workstation 30 is miming an internet browser program such as Microsoft Internet 
Explorer or Netscape Navigator. The dossiers are directly addressable in the same sense 
that a web page is directly addressable through a URL (Uniform Resource Locator), 
Specific dossiers are assigned and addressed through a URN (Universal Re$ource Name) 
and this makes it possible to send a reference to a dossier in an email. 

The form of the URN follows a standard industry notation: 
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For instance a URN for a specific protein sequence held in SWISSPROT would look like: 

igh ,<* 

UIlN:x-protein:$WISSPROT:P3845 

On entry of the URN by the user 10 in tie seantog system 20, the searching system 20 
has first to resolve which one of the dossier types needs to be created from the differs* 
dossier types avaflable. This is done on the basis of the content type, the data source 70 or 
80, therelease, theentifyid, the version number, the security, the user 10 and the user's 
10 preferences. 

Once the dossier type has been selected, the searching system 20 selects and then use. a 
specific dossier definition to create a selected dossier. The dossier definition include, a 
desertion of the reports that make up the dossier. The dossier definition includes bom 
references to child dossiers and reports. These dossiers and reports fbnn themodd part 
of a model view controller (MVC) architecture. 

A view of 4e wteri aoMtar i. a**™*, ^ fte ^ ^ ^ 
l*e to pert of the view is used for ^viga.i,,,, fte ^ ^ ^ rf 

«P«fic ^ The ^ per, of vtaw i s fte di»pl w of o» or of those 
repv*. One specife, exsmpte of s ™w is seated wd, p^, displayed » 

• browser eooWniog a aavigrto,*. Socaod^portai, separate,*.*. oa*r«iw, and 
vtewmg n^sa, also be crearf within .M. KehitecMre ^ 
he <ot instance a Mdc (tot apphcatioii. 

Dossier types are heavily „, ft ^ ^ A „ OT ^ ^ ^ ^ ^ 

dossier type) may lie crested based upon on existing dossier typo wifli 000 or two 
changes. This -inherifcnee meehanisnr- Hows the iaotaring <», 0 f ^ 
of the different dossier types. Purftennore dossiers (yjo tnay oootaio other dossier types 
m order to group reports definitions. 
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Reports 'used in the dossier can be of several types. Standard reports are read-only 
constructs that merely present information and do not provide the user 10 with any 
mechanism by which the user 1 0 can modify data- Actions are a special type of report 
that allow active interaction between the user 10 and the data. For instance, whilst 
viewing amolecule dossier, the user 10 might want to .request that the molecule was sent 
for further testing. Other actions might include ample sorting and filtering within a 
spreadsheet, parameterisation and activation of a computational engine, annotation or data 
entry into database or data export to a client side program such as Excel. 

Dossier definitions can be programmed using a variety of programming languages and 
standards. In the preferred embodiment of the invention, She dossier definition is in the 
form of an XML document An example of such a dossier definition is given in Pig. 9 
Which shows a dossier definition for a person containing a single report which shows the 
name, job title, email, user id and department for a given individual This example is only 
one example of a dossier definition and more complex definitions may be created using 
the features of the XML programming language. 

Figure 2 shows a diagrammatic representation of a set of dossiers. In the central part of 
the diagram there is a conceptual drawing of a dossier 200 on a particular kind of 
biological assay, ie, an assay dossier. The biological assays are often similar to ^ 
pregnancy test kits in that we are trying to detect some kind of colour change in the wells | 
of plastic plates after the introduction of a novel molecule. Pharmaceutical companies 
may test tens of thousand of compounds it* such assays every day. 

In the example shown the assay relates to how novel molecules inhibit the addon of an 
enzyme that is r esponsible for the continnous division of tumour cells. We could similarly 
have chosen an assay related to irritable bowel syndrome, obesity or asthma. In all of 
these cases the structure of the assay dossier 200 would be very similar although 
obviously the content of the reports would vary. 

The assay dossier 200 that appears on the screen at the workstation 30 has two panels. 
The left hand panel 210 shows a list of the reports 220 that are available in the assay 
dossier 200. These include a description, of (he screen, a protocol describing how the 
assay is performed, a list of similar assays, a list of all the results for molecules test in the 
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Between life science dossiers: 
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From 


To 


Relationship 


Protein 


Gene 


A protein is coded by a Specified gene 


Gene 


Protein 


A gene'can code for a number of different proteins 


Gene 


Gene 


Gene can link to Other genes by virtue of sequence 
homology 


Protein 


Protein 


Proteins can link to other proteins by virtue of 
sequence homology 


Protein 


Assay 


A protein may have an associated assay(s) in 
which the binding of drugs is measured 


Assay 


Protein 


An assay may have an protein target 


Protein 


Pathway 


A protein can be part of a metabolic or signal 
transduction pathway 


Pathway 


Protein 


A pathway may contain many different proteins 


Assay 


Batch 


A batch of small molecule is tested in an assay 


Batch 


Assay 


A batch of compound may be tested in multiple 
assays 


Batch 


Molecule 


A batch contains a primary constituent and 
multiple impurities. 


Molecule 


Batch 


A molecule can by synthesised multiple times in 
different batches 


Botch 


Gene 


A batch of compound can have an effect on the 
expression of a particular gene. 


Molecule 


Person 


A molecule can be made by a particular chemist 


Molecule 


Protein 


A protein can be docked in-silico into a protein 


Experiment 


Person 


An, assay can be performed by a person 


Research 
Programme 


Person 


A research programme can be run by a particular 
person * 


Person 


Task 


A person can be assigned a series of tasks 
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Access to the dossiers and/or report* within the dossiers can be restricted to certain usets 
10. The security mechanisms are based on and use existing security frameworks such as 
Java Authentication and Authorisation Service (JAAS) and light Weight Directory 
Access Protocol (LDAP). 

Based on nsername and password or other authentication mechanism, security may 
restrict access to the .•following entities: 

* The searching system 20 

♦ Administrative Tools 

• Specific Dossier Definitions 

* Specific Report Definitions 

• Specific Content Types 

♦ Specific Repositories 

• Specific Ids 

* Specific Action Definitions 



The searching system 20 supports all standatd features such 



as groups and roles. 
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monitor and wiiUKure trie search-ino- cn^t^ 20 



in XML format 

Aofi^offte^lOi 8 I oggrfmdcmte ^^^^ aua ^ Men 10or 
ataMstna™. For taatance, when , partis user 10 lo* on or logs off of «» 

p.rtorn.ed. Sneh a togging w<mW allow W p«r^w fcffltag systona » t* « 
using dossiers and reports. 



RleKo:nSC0U6/GB/]?l 4 July 2003 

Applicant brtcJUdDH lid 

Any URN can be specified as a favourite for a particular user 1 0 and the user can add a 
comment and/or use this as a bookmark in the internet browser for rapid re-access of the 
specific dossier. This facility is incorporated into the latest version of die internet 
browsers. 

The user 10 can define tire visibility and accessibility of their list of favourites in the 
internet browser. If they make one of mefevourites public, all other members of their 
organisation can see that they have an interest in that particular entity. Such mechanisms 
promote collaboration over geographical and departmental boundaries. 

When ever something significant happens in the searching system 20, event messages can 
be fired. Alert controllers examine the stream of event messages to determine whether or 
not to take action based on them. This is done by comparing previously stored event 
messages whh new messages. One manner in which mis is done is to prepare and store a 
hash value of the old message and compare the stored hash value with a newly calculated 
hash value. Examples of action might be notifying a list of interested users 10 that a 

particular database 70 has been updated, notifying a user 10 that another user has made a 
fevourite of one of the items in the first user's list of favourites, or one of a user's 10 

favourites has been annotated or updated. 

Alerts may also be used by administrators to monitor the searching system 20, For 
instance, notifying me administrator when a user 10 logs in from more than one machine 
at the same time, or when a password has been entered incorrectly three times in a row, 
when a data source 70 has gone down or when a dossier definition was modified etc 

The alert controllers decide to whom an alert message should be sent and the content of 
hie message. Each controller may use one or more methods for sending the message to 
the user e.g. by email, SMS message or flags in reports. 

The searching system 20 can include a plurality of server computers 50 which can work 
together. Specific ones of the dossiers and reports can be shared between individual ones 
of the server computers 50. This allows reports to be shared between companies with 
sharing the data sources 70 and 80 from which they are generated. Each company or user 
10 is in control of their own securiiy rules. 
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The searching system 20 includes a query constructor tool for searching for the dossier 
that meets a set of criteria- This can be better understood by considering the tools as 
expression queries within, the context of the dossier and reports, rather than the queries 
^feeing carried oxtt on databases and tables within the databasea. This enables ar> abstraction 
fiom the language of IT into the language of science, as well as saving memory space as 
discussed above. This means that the user 10 of the searching system 20 does not have to 
learn a new language or understand the architecture of the corporate databases. There is 
good anecdotal, evidence to suggest that not one single person in the IT department of a 
big pharmaceutical company understand all of these either. 

To understand how this abstraction is achieved within the product it is necessary to 
understand that queries are built from interconnected logic elements and template objects. 
A template object can be described as a small data entry form into which a user can enter 
text, numeric data including ranges or more complex items such as a chemical 
substructure query. A template also has an associated piece of SQL code which when 
combined with the data entered into the form part becomes part of the clause in the 
WHERE part of SQL query that is generated by the query constructor tool. 

Template objects are associated with report definitions. Where possible all reports in the 
search system 20 will have associated templates i#. the query constructor tool. This 
mapping also defines that every template is associated wife a particular dossier item. Fig. 
3 shows this graphically. The upper part of Fig. 3 illustrates a protein target report 300 
such as that displayed in Fig. 2. The tower part of Pig. 3 illustrates a protein target 
template 3 1 0 dial can be used to specify queries that return entities of type assay (it will 
be recalled that the protein target report appears in the assay dossier). 

Interestingly one of the effects of this mapping is that as an administrator adds more and 
more reports to a particular dossier the templates available in the query constructor tool 
also expands. The work performed by the administrator yields a double benefit. 

The operation of the query constructor tool will now be illustrated. Imagine for a moment 
that query constructor tool could only manage a single template at a time (in fact the real 
power of the tool is far more extensiy© os we shall, sea). 
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The user 1 0 enters the variable "%guinea% in the description field of the protein target 
template 3 10 and runs the query. The query constructor tool takes the value of this 
variable and inputs into the SQL code already provided and stored in memory. The query 
constructor returns a list of UKNs for all the protein dossiers in which the description 
field of the protein target report contains the word "guinea". The user 10 can now -drill 
down to view the dossiers of all the proteins returned by the query. From each of these 
dossiers the user 10 can utilise hyperlinks to view dossiers of other types associated with 
these proteins as explained with reference to Fig. 2. 

Fig, 4 shows another type of report and its corresponding template in the query 
constructor. In ihds figure, a structure report 400 within a small molecule dossier has a 2D 
representation of the molecular structure of the small molecule. The molecular structure 
may be fbnrished by an embedded Java chemical renderer in the searching system 20. 
Alternatively this could be generated in an image graphic file or using a plug-in for a 
browser. The template 410 in the query constructor tool has a corresponding chemical 
sketcher that allows users 10 to draw a structure and find similar molecules on the basis 
of substructure searching or chemical similarity in th e data sources 70 and SO. 

A ftirther report and its corresponding template are shown in Fig- 5. Fig. 5 shows a name 
value report 500 that returns a list of name value pairs. The number of pairs is unknown 
until the time the name value report 500 is runu Such name value reports 510 are typically 
used for calculated physical properties in lead optimisation and refinement structure 
activity analysis, The name value template 510 allo ws any number of name-value pairs to 
be used to define which dossiers are returned. 

As mentioned above, when the name value pair report 500 was created, the administrator 
would have had to write a piece of SQL code to generate the contents for the report. The 
query constructor tool can take this SQL code for the report and create *Hixverted" SQL 
for the template. This inverted SQL returns a set of URNs (identifiers of dossier) 
containing all those dossiers which match the queay definition. The table below shows the 
SQL for the name value report S00 and the inverted SQL created automatically by query 
constructor toll far the name value template 510. 
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Report 



SELECT Blaine as Name, 
A.IC50 as IC50 

FROM A, B, C 

WHERE C.URl={Um} 
AND AJCey - C.Key 
AND BJKsy= C.Key 



Template 



SELECT C.URI as DRI 
FROM A3, C 
WHERE (A.IC50 > 3 AND 
AXogP>2AND 

A. MW < 500) AND 
(AKey = CKey AND 

B. K6y=CJCey) 



It can be seen that^oie than one template caabe uaed to specify a set of cities 
P^^lfce temple ^ 

can be connected together using logic elements such as AND, OR, NOT eta 

At tos stage of onr discussion the set of enmiei to be returned can only be defined from 
the templates directly related with the deader for that type of entity. Although 

m * e hyperUDkfi be£ ^ ^ and a S such limiting the flexibility of the 
query constructor tool. 
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The SQL code used to generate the a$$ay report shown in Fig* 6 is as follows: 



SQL to generate the report 


SELECT 




ADB.assay_id 


AS ASSAY P 


AOB ,batch_regno 


AS BATCH, 


A „ measurement 


AS MEASURED , 


AOB - as eay_yalue 


AS RESULT f 


AQB, range 


AS ESROR, 


A -units 


AS UNITS, 


A, description 


AS PESCHIPTIQN, 


*um:X-assay:ArrayDBt ' || AOB . assayed AS trai 


FROM 




ASSAYS_GW_BATCHES 


AOB, 


BATCHES 


Br 


ASSAYS 


A 


"WHERE 




(AOB .batch_regno 


^ B .be.tch_ regno) 


AMD (B - main_compQnent = {dbn}) 


AND (AOB -assayed 


= A-assay_i<i) 



Lets us now examine what the assay template associated with this assay report would 
look like. This is shown in Fig, 7, In many respects it appears to be the same as that 
shown in Fig. 3. On the left of Fig. 7 we can see the familiar template 700 where each 
cohram has been mapped to a label 710 and an editable field 720. Where the entry an the 
editable field 720 can be used to select only those dossiers in which associated report 
contains matching ones of the entry. We can use the template to select hatches of small 
molecule on the basis of their test results in various assays* To select the assays we can 
either use the editable field labelled '"assay id" to specify a particular assay or use the 
templates from the assay dossiers to specify a particular subset of the assays in which the 
batch was tested. 

On the right of the figure we show a template 730 for the assay description, jfrohi ihe 
assay dossier definition, connected to the assay id field of the template 710. The square 
labelled "ASSAYS" 730 indicates that the context of the query has changed from 
returning batches to returning assays. 
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The SQL code for generating this query is given below: 



SQL to generate a query with linked templates 



SELECT 



FROM 



'um^^tcH^tchirB- t| AOB.Batch_±d fts 



BATCHES 
ASSAYS 



(AOB.batch_regna . B-tetrf^regno) AND 
(ADB. assayed « A. assayed} ABD 
«OB.aSsay_valtie <; 40) AHD 
(AOB . assay_id dLa 
( 

select A.as0ay_i<i 
from AS SAtj^abus A 
where 

(A. description lix* «Guiiiea%«) 

) 



URX 



This SQL code for the template contains nested queries - the outer query represents the 
^^fortherepo**™^ -andfceneste dpart comes 
fiom a xeport associated with the assay dossier. 

In summary, we can state that the ^-^ofBQLeodefcfl. 
constructor tool can be generated with the following three rules. 

♦ For each template perform an inversion oftheSQLusedto create the report such 

fh3tthe inVCTted ^teturnsthe set of entities that will cont^n matching entries 

m the reports of every element of the set 

• Where we combine multiple templates with simple logic elements use the same 
logic elements to connect the inverted SQL together. 
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* Where we combine templates that ate related through hyperlinks in reports place 
the nested inverted SQL for the linked to object within the inverted SQL of the 
linked from object. 

Now we will look at a typical cross domain query and see how the query is represented 
graphically. 

The query we use is an example taken from the IBM Discoverylink web site. 

"Show me all the compounds that have been tested against members of the serotonin 
family of receptors, have IC50 values in the nanomolar/ml range, a molecular weight 
between 375 and 425. and a logP between 4 and 5." 

To build a query representation for the above query the invention allows the user 10 to 
employ two different methods to connect templates from different query context subsets. 

Firstly we describe a forward reforence method, in which we use the hyperlinks from a 
dossi er that is being matched to referenced dossiers. And, secondly we also describe an 
embodiment of the invention in which we use the hyperlinks from related dossiers to the 
dossier being matched 

In Fig. 12 we can see the mapping of reports 1220 in the small molecule dossier 1200 to a 
set of templates 1 230. These templates 1230 may be employed by the user 10 to select a 
set of references to small molecule dossiers. Note that tbe "assays run" report 1240 
contains cells with hyperlinks to high throughput screening dossiers. Therefore the assay 
run template 1250 can reference additional templates from the high throughput screen 
dossier definition such as the protein target template 1260. 

• • The .protein target report 1 270 in the high throughput screen dossier 1210 contains 
hyperlinks to a protein dossier and therefore we may use templates, such as the protein 
femily template 1280, from the protein dossier definition to additionally constrain the 
selection of high throughput screens relevant to tbe search. 
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Frequently, hyperlinks between dossiers are bi-directional and therefore another 
embodiment of this invention includes the use of reverse hyperlinks. Thus in the above 
example we would use links fiom a protein to a screen and from a screen to a small 
molecule. 

In Fig. 8 } we can see the mapping fiom reports 800 to templates 8 10. Also, mat a query 
for a set of small molecules is defined by two templates associated with the small 
molecule dossier combined with an AND logic element 860. 

In the description above we used the hyperlink from a report in the small molecule 
dossier 820 to link to the assay dossier 830. K will be realised that we will frequently have 
bidirectional links and that there are potentially hyperlinks fiom reports in the assay 
dossier 830 to small molecule dossiers 820. 

frFig. 8 we have taken a template 840 that appears to be associated with a report 850 in 
the high throughput screen dossier 830 and used it to constrain the set of molecules 
returned. Note that (he report 850 in the high throughput screen dossier 830 returns 
hyperlinks to small molecules and this is why it is logical for such so^alled foreign 
templates to be permitt ed. 

The screening results template 840 in Fig. S means "for a given set of high throughput 
screens 830 return all the molecules that have been screened in these screens and meet the 
other criteria of the template". These molecules are then additionally filtered by the other 
templates 810 combined by the AND logic element 860. 

The given set ofhigh throughput screens is defined in exactly the same way as any other 
entity. Therefore such foreign templates have the effect of changing the context of the 
search. 

MFig. 8, the setofhigh throughput screens is constrained by a set of proteins associated 
with the screens and in turn the set of proteins is constrained by a template that defines a 
protein family. 
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A further example of dossier is shown InFigs. 10 and 11 that includes person dossiers 
and project dossiers. In Fig. 10, the person dossier definition 1000 includes one or more 
instances (in this example three) of person report definitions 1020. Similarly a project 
dossier definition 1010 includes one or more instances of project report definitions 1030. 
Both the person report definitions 1020 and the project report definitions 1030 include a 
display definition for each report which instructs the workstation 30 how to display the 
person report and the project report respectively. The person report definitions 1020 and 
the project report definitions 1030 further include a retrieval definition in the form of an 
SQL statement to retrieve the data from the databases 70 oc other data sources 80. 

The person report definition 1020 further includes a dossier reference definition 1040 mat 
describes how hyperlinks may be constructed from person dossiers to project dossiers. 

As can be seen from Fig. 10, each of me instances of the person report definitions 1020 
has a corresponding one of the person template definitions 1050 as is discussed above. 
The person template definition 1050 include a template form - discussed in connection 
with Fig. 8 - & search definitions and a dossier linkage relationship 1070 to indicate its 
relationship with one of a plurality of project template definitions 1060, A person query 
context subset is defined as all of those reports in the person dossier. 

The project template definitions 1060 are similarly related to the project report definitions 
1020 and have a template form, a search definitions consisting of a SQL statement and a 
dossier linkage relationship 1070. 

Fig. 1 1 shows another view of these relationships in which the same reference numerals 
are used to indicate 1he same objects as in Fig. 1 0. In tins Figure, the person report 
definitions 1020 produce three instances of person dossiers 1120, each of which is 
accessed by one of the UKNs 1100. Each ofthe instances of the person dossiers 1120 
refer* to one person. The instances of the person dossiers 1 1 20 include reports 1140 with 
hyperlinks 1 1 10 to one or more project dossiers 1 130. The project dossiers are defined by 
the project dossier definition 1030, As can be seen, from this example^ the top displayed 
instance ofthe person dossier 1 120 has two ofthe hyperlinks 1 1 10 which refer to two 
different ones ofthe project dossiers 1 130. 
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The foregoing is considered illustrative of tte principles of the invention and since 
numerous modifications will occur to those skilled in fee art, it is not intended to limit the 
invention to the exact construction and operation described. All suitable modifications 
and equivalents fall within the scope of the claims. 



*!File Na ITS Q01 ldfGB/Pl 
) Applicant: Intftllidos lid 



4 July 2003 



Claims 

I. Method for searching at least one data source using a query and comprising the 
following steps: 

- ■ a firat step of generating a plurality of query templates, where each qttery template 

can be used to define at least apart of the query; 

- a second step of logically joining at least some of said plurality of query templates 
to create a query representation; 

- a third step of inputting or selecting input variables into data entry fields of said 
query representation; 

- a fourth step for selecting data elements that will be returned by the query; 

- a fifth step of generating the query using the query representation, input variables 
and. the data elements to be returned; 

- a sixth step of sending said query to the at least one data source; and 

- a seventh step of returning source results generated using the query from the at least 
one data source; 

- an eighth step of generating a reference to a collated data set for each of the source 
results 

- a ninth step of selecting one of source results. 



2. Method according to claim 1 , further comprising a prior step of defining how to 
generate one or more types of collated data set (dossiers). 
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3. M^daooo^,„ ctol ^ 2>wh ^ fto ^^^ re| ^ toai£a3t 
of the definition for generating a dossier. 

4. Method according*, claim 1, wherein zero source results are returned hffoe eighth 
step. 

5. Meftodaccordingtooneof claims 1 or2, wheremtfteqwrej^eittation is defined 
with reference to die dossiers. 

6. Method according to any one of foe above claims, fiuiiier indyding: 

a step of marching & r matching dossiers by means of the query representation. 

7. Method according to claim 6, wherein the step of searching for dossiers that match the 
query representation further includes a step of direcfly searching foe data sources from 
Whfch the dossiers are derived, 

8. Method according to claim 7, wherein the step of searching for matching dossier 
mcrndes a step of logically comparing other ones of the dossiers that are indirectly or 
directly referenced by foe dossier being matched. 

9. The method of any one of the above claims, further compming: 

- a tenth step of displaying a selected one of the source results as a displayed dossier. 

1 0. Tie method of claim 7, further comprising a step of hypertinkmg from an element of 
the r" * 



1 1. The method of any one of the above claims 1 or 10 further comprising 

a step of denning the content of the dossiers Using one or more dossier definitions. 

1 2. The method of claim 1 1 , wherein the dossier definition includes one or more instances 
of a report definition, whereby each report definition includes a retrieval definition to 
define how foe members may be retrieved from a data source, a display-definition 
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defining how the results may "be displayed, and an access definition for defining the 
permitted access to an instance of the report de£mtioa 

13. The method of claim 32, wherein the dossier definition further includes 
one or more dossier reference definitions, a dossier reference definition defining a link 
between at least one instance of the report definition and at least one instance of a 
dossier definition. 

14. The method of claim 1 2, wherein the dossier reference definition defines how a 
hyperlink is created between an element of a report and a dossier. 

15. The method according to any one of claims 12 to 14, further including a step of the 
creation of one or more report template mappings using one of the report definhions- 

1 6. Method according to any one of claims 12 to 15 further comprising a step of taking the 
retrieval definition and inverting said retrieval definition to create a search definition, 
the search definition being used In the construction of a query that will retrieve source 
results. . 



1 7. Method according to any one of claims 12 to 15 further comprising a step of taking the 
display definition and inverting said display definition to create a template form- 

18. Method according to claim 17, wherein one or more columns in a tabular one of the 
report definitions are associated with one or more of the data entry fields in the 
associated one of the template forms. 

19. Method according to any one of claims 12 or 15, further comprising a step of creating a 
dossier linkage relationship "between an element in a template form with a 
corresponding one of the dossier definitions. 

20- Method according to claim 19, wherein the said dossier linkage relationship is created 
using the dossier reference definition in a corresponding one of the report definitions. 
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21 . Method according to anyone of claims 1 1 to 20, wherein fcc step of collating the 
selected source result further includes a step of selecting a dossier definition for use fa 
collating the source data. 

22. Method according to any one ofthe above claims, wherein the step of creating a query 1 
representation comprises a step of assembling a query structure in which said plurality 
of query templates are joined using nesting and/or Boolean logic. 

23. Method according to claim 22, further including a step of constructing a query 
representation layer using aplurality of query templates logically joined together using 
Boolean logic. 

24. Method according to anyone of the above claims, wherein lie step of creating a query 
representation, further comprises a step of creating a plurality of nested query 
representation layers, wherein the nested query representation layers maybe used to 
check the content of dossiers directly or indirectly referenced by the matching dossier. 

25. Method according to any one of claims 11 to 24, further comprising a step of creating a 
Plurality of query context subsets of the plurality of query templates wherein each query 
context subset contains query templates that are associated with a single dossier 
definition through report template mapping. 

26. Method according to any of the above claims, further comprising a step of creating a 
Plurality of access control subsets of the plurality of query templates wherein each 
access control subset contains only those query templates that may be accessed by a 
particular user of the system. 

27. Method according to claim 23 of creating a query representation layer comprising a 
Plurality of query templates belonging to a selected one of the pluralityof query context 
subsete and also belong to the access control subset. 

28. Method according to any of the above claims, former comprising a step of defining an 
imual query context of the query by selecting an initial one of the. plurality of dossier 
definitions. 
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29. Method according to claim 24 of creating an initial query representation layer. 

30. Method according to claim 29 3 further comprising a step for setting the query context of 
the initial qpery representation layer according to the initial query context 

3h Method according to one of claims 27 to 30 for nesting one or more empty query 
representation layers below a selected one of a plurality of query templates in the query 
representation layer and of setting the query context of said empty query representation 
layers. 

32. Method according to one of claims 19 to 31, farther comprising a step of creating a 
dossier linkage subset having one or more dossier definitions, wherein each member of 
the dossier linkage subset has a dossier linkage relationship with the said selected one 
of the plurality of query templates. 

33. Method according to claim 32, further comprising a step of setting the context of the 
query representation layer by selecting a member of said dossier linkage subset. 

34. Method according to any one of the above claim* further comprising a step of Storing 
and retrieving sets of query representations. 

35. Method according to claim 34, further comprising a step of blocking retrieval of query 
representations which contain query templates not belonging to the access control 
subset for the user. 

36. Method according to one of claims 24 to 35, further comprising a step of nesting a 
retrieved query representation below a selected one of the plurality of query templates 
in the current query representation. 

37. Method according to claim 36, further comprising a $tep of ensuring that the initial 
context of the said retrieved query representation has a dossi er linkage relationship with 
the said selected one of the plurality of query template* under which it will be nested. 
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38. Method according to claims 24 to 37, further comprising a step of adding a query 
template with a nested empty query representation layer to the existing query 
representation. 

39. M e thod according to claim 38, hr^compd^^ottdh^^^ 
reputation layer into which me said query template and nested query representation 
layer will be added 

40. Method accordiug to any one of claims 38 and 39, further comprising a step of creating 
a template linkage subset of the plurality of query templates mat have a linkage 
relationship to the query context of the said query reputation layer. 

Method according to claims 38 to 40, further comprising a step of selecting one of the 
smdpmrality of query templates within the template linkage subset 

42. Method accoxding to anyone of claims 38 to 41, further comprising a step of setting the 
query context of the said added query representation layer to be a selected one of the 
dopier definMoas. 

43. Memod acco^ to claim 40, former ^ 

subset of dossier definitions wherein the said dossier definitions contain report 
demons whichhave a report template mapping to the said selected query template. 

44. Method according to any of the above claims, wherein two or more ofme at least one 
data sources have a different structure. 

45. Method according to any of the above claims, further comprising a step of exporting 
said results to an application program. 

46. Method according to any one of the above claims, further comprising a step of 
assembling one or more compound query templates from the plurality of query 
templates and using the compound query templates to create a query representation 
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47. Method according to any of the above claims, further comprising a step of presetting 
and or hiding the contents of data entry fields in a query template. 

48. Method according to claims 46 or 47, farther comprising a step of storing and 
retrieving compound templates. 

49. Use of the method of any of the above claims for generating the query relating to data in 
the life science industries. 

50. Use of the method of claim 49 for generating the query relating to biological, genetic, 
chemical ox pharmacological data. 

5 1 . System for searching data comprising: 

- a display device for displaying and editing a set of query templates with data 
entry fields; 

- an input device for inputting at least one variahle into at least one of the set 
of query templates; 

- a query generation device to generate a query using the method of one of 
claims 1 to 50; and 

- a display device for viewing the query that will be sent to one or more ^ 
databases. 

52. System according to claim 5 1 further comprising a selection device to select at least one 
element in the set of query templates to display at least one query template with data 

entry fields. 

53. System according to one of claims 51 or 52 further comprising a query database having 
a plurality of reports fot display on the display device and a plurality of query 

. templates, wherein each one of the plurality of query templates is linked to one of the 
plurality of reports. 
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fields in the quay template. 



SO. 
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representation »uto natural language. 



science industries. 



62, System according to claim €1 

or pharmacological data. " 
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63 . Database structure for use in a method for searching data sources having a plurality of 
dossier definitions, report definitions and a plurality of query templates having a 
plurality of input fields, wherein one of the plurality of report definitions is associated 
with one of the plurality of query templates. 

64. Database structure of claim 61, further comprising a set of access controls on each 
dossier definition and report definition. 

65. Database structure of claim 61 or 62, further comprising a display definition defining 
how to display results returned from the method for searching databases. 

66. The system and method for searching at least one data source as set forth in the 
description with reference to the drawings. 
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Abstract 



A method and system for searching at least one data source using a query is Scribed 
In a first step one or more query templates are generated in which each query template 
can be used to define at least a part of the query. Several of the query templates are 
jomed logically to create a query representation and input variables are selected or 
mputted into data entry fields. Subsequently, those data elements which will be returned 
by the query are selected and finaUy a query generated which is sent to at least one data 



source. 



(Fig. 8) 
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